Exploratory analysis of Perceptions about Science and Open Science

This analysis revolves around the undergraduate thesis carried out by Franco Sebastián Benítez, under the supervision of Débora Burin and Lucas Cuenya, from the School of Psychology of the University of Buenos Aires.

As set in our preregistration, we are checking the following aspects:

1) Check for exclusion criteria in the demographic data, and in the completion rate.

2) Describe the sample’s demographic characteristics.

3) Analyse the total percentage of “yes” responses to belief in crisis. Analyse as a function of career stage and methodological approach.

4) Qualitative analysis of open field response to belief in crisis.

5) Percentage of agreement with each, and combined, statements about replication crisis, p-value, publication bias. Analyse as a function of career stage and methodological approach.

6) Percentage of agreement with each, and combined, statements about perceived barriers. Analyse as a function of career stage and methodological approach.

7) Percentage of agreement with each, and combined, statements about attitudes against adopting open science practices. Analyse as a function of career stage and methodological approach.

8) Qualitative analysis of open field response to attitudes about barriers against adopting open science practices.

Loading the necessary libraries

Loading the dataset

As we see, the data contains 95 rows and 53 columns.

Now, let's rename the columns to make it easier to manipulate and plot.

Let's check all the types.

Let's check all the renamings.

Analysis

1) Check for exclusion criteria in the demographic data, and in the completion rate

To be considered "researchers", our respondents should either have participated in a research project or have publicated in a scientific journal. In both cases, in a maximum range of five years.

As we can observe, now we have five participants (rows) less.

2) Describe the sample’s demographic characteristics

2.1) Education

We see the data.

And we plot them directly, creating a function that we are going to reuse later.

Let's go to clean the data a little. We will group each category into four large groups: "Doctorado", "Licenciatura", "Especialización", and "Maestría" regardless of whether it is ongoing or completed. In addition, since we only have one "Postdoctorado", the same will be located in "Doctorado".

First, let's go to separate our different categories by semicolon. For that, we create a new dataframe specific to our variable. The same contains more rows because of the previous rows were expanded by semicolon. The variable, in this case, will be called "education_df".

Note: We add the "belief" variable because we are going to use this variable later.

Then, we have the next new dataframe:

Now, let's replace, creating a new function that we are using later.

And we plot, creating a function for later, this time vertically.

Thereby, we have that most respondents have or are getting a doctorate grade, followed by those that have or are getting a licentiate degree.

2.2) Research area

First, we create a function that allow us to: 1) remove accents and uppercases, 2) join text from all rows in serie, and 3) drop NA's in case of having.

Now, we create a function to plot. (We are going to reuse this function later.)

Finally, we extract text from a serie and plot it.

We run again adding a list of stopwords as argument.

We can see that neuroscience, neuropsychology, social psychology ("social"), developmentental psychology ("del desarrollo"), clinical psychology ("clinica"), and health psychology ("salud"), seem to be the most frequent areas in our sample.

2. 3) Position

Let's reuse the function to create a horizontal bar plot.

As it was done previously, we store the new values to a new dataframe. In this case, it will be called "position_df".

We plot the data.

Better, but it still does not look very good.

We group the unique values in a category we will call "Other".

We plot the data again.

We can see that most respondents have the rol of "Ayudante de Trabajos Prácticos", by its acronym "ATP".

2.4) Methodology

Now, let's see the type of methodology that predominates in our sample of researchers. Directly, we plot the methodology serie from our main dataframe.

As it can be seen, most respondents consider themselves predominantly practising a quantitative approach, being approximately a quarter those that strictly practise a qualitative methodology.

2.5) Age

As to respondents' age, first we extract statistics and, next, plot the data directly.

As we observe, most respondents are located in the 20-40 age group.

3) Analyse the total percentage of “yes” responses to belief in crisis. Analyse as a function of career stage and methodological approach

3.1) Belief

Regarding the belief in crisis variable, we directly plot the data.

As we can see, the results are almost divided. Other surveys, such as the Baker (2016) survey, perceive higher results as to "yes" percentages.

Below we will be seeing how those respondents who said "yes" justify their answers.

3.2) Belief as a function of career stage

Now, we group the belief in crisis by career stage (researchers' education). We plot the data directly, enlarging the plot size.

We don't observe a big difference analyzing by career stage.

We can also analyze by position in college.

In this case, ATP's seem to believe a little more in crisis in science.

3.3) Belief as a function of methodological approach

Let's take a closer look.

Considering methodology, we can see that there are not substantial differences between usage of mixted and quantitative methodology as to belief in a crisis in science. However, there is a small difference between those that use predominatly qualitative methodology. That is, qualitative researchers seem to believe more in a crisis in science.

Likewise, it is worth noting that the results between those using predominantly quantitative methodology are divided, given that the called "replicability crisis" has a lot to do with statistical problems, such as huge confidence in p-value and null hypothesis testing, and statistical fallacies.

3.4) Belief as a function of age

In regard to respondents' age, there is not a clear correlation with the belief in crisis in science variable.

4) Qualitative analysis of open field response to belief in crisis

Now, let's go to reuse the make_worldcloud() and extract_text() functions, built previously, to analyse justifications to "yes" answers to belief in crisis.

It doesn't look very good.

We add stopwords.

As now we can see, having filtered vague words, there are some words more frequent among those that answered "yes" to the question "Do you believe there are a crisis in science?". Such words more frequent are replicability ("replicabilidad"), system ("sistema"), absence ("falta"), and quality ("calidad").

Now let's see how many comments mentioned the "replicability" word as a cause of the crisis. First we create a new dataframe removing accents and uppercase of the comments.

We add the comments mentioning "replica" to a list. We use "replica" to catch both "replicacion" and "replicabilidad".

We print the data formatted.

As we can see, only seven comments mention "replicabilidad" or "replicación" as causes of crisis in science.

5) Percentage of agreement with each, and combined, statements about replication crisis, p-value, publication bias. Analyse as a function of career stage and methodological approach

In this point, let's go to create tidy tables that allow us to plot easily. These data, given that are provided for a likert scale, will be plotted with the HH package in R. The plotting code will be available in the R Script.

First, let's create a function that builds a table from a column provided.

Second, we apply the previous function to create tables iteratively from an index to other.

Third, we combinate the tables creating only one.

Finally, we convert numerical values to int, given that by default they are strings.

We are creating three variables relative to:

1) Answers to "Value each one of the following issues concerning your opinion about science". This variable will be called "science".

2) Answers to "Mark the option that best represents your knowledge and experience with each practise in the last five years". This variable will be named "experience".

3) Answers to "Choose the option that best represents how much importance you consider that has each one of the following practises to improve quality and efficiency of research in your research area". This variable will be named "efficiency".

Let's see the results.

We sort labels.

We export (to be plotted from R and the HH package).

And we import the resulting plots.

(1) Science:

(2) Experience:

(3) Efficiency:

The plots in the best quality can be found in https://github.com/francosbenitez/thesis/tree/master/images.

Now, we create and export tables grouped by career stage and methodological approach.

First, we create a function that receives a column and a group and transform them in a dataframe table.

Then, we create iteratively tables by index, but, unlike before, this time will be by group.

We create a function to export iteratively.

We export.

6) Percentage of agreement with each, and combined, statements about perceived barriers. Analyse as a function of career stage and methodological approach

As we have already done previously, we create a new dataframe with rows extended by semi-colons.

We replace the numbers at the beginning of each row. (We reuse the replace_columns() function defined previously.)

We plot.

Now we plot it adding percents.

8) Qualitative analysis of open field response to attitudes about barriers against adopting open science practices

9) Other analysis

In the end, let's see all comments to the "Finally, if you have any ideas or comments regarding this survey or the topic it covers, please write them briefly below" item.

We add the mean and standart desviation of the tables.

We transform the count values into percentages.